A Checkpointing-Recovery Scheme for Optimistically Synchronized Parallel Computations
نویسندگان
چکیده
This paper presents a checkpointing-recovery scheme for optimistically synchronized parallel computations. The scheme relies on a checkpointing protocol, namely hybrid state saving, embedding both sparse and incremental state saving modes, and on a state recovery procedure embedding both forward and backward recovery modes. This scheme is a generalization of many previous solutions, which can be obtained as particular instances of it by selecting appropriate values for the checkpointing protocol parameters. We also present two regulating algorithms to tune the checkpointing protocol parameters on line, in order to make the protocol react to dynamic rollback behavior. The scheme is compared to previous solutions through a case study in the context of optimistically synchronized parallel discrete event simulation. The comparison has been carried out by using a classical benchmark in eight diierent conngurations. The obtained data show that our scheme allows faster execution and, in addition, keeps quite low the amount of memory used for recording state information.
منابع مشابه
On the Trade-off between Time and Space in Optimistic Parallel Discrete-Event Simulation
Optimistically synchronized parallel discrete-event simulation is based on the use of communicating sequential processes. Optimistic synchronization means that the processes execute under the assumption that synchronization is fortuitous. Periodic checkpointing of the state of a process allows the process to roll back to an earlier state when synchronization errors occur. This paper examines th...
متن کاملAn Analysis of the Efficiency of Optimistically Synchronized Parallel Simulators
In optimistically synchronized parallel simulators logical processes execute events greedily and recovery from timestamp order violations is based on rollback. These type of simulators have shown the potential to exploit a high degree of parallelism; however, they may result ineecient due to possible unacceptable percentage of CPU time spent executing events that are rolled back (i.e., non prod...
متن کاملAn Enhanced MSS-based checkpointing Scheme for Mobile Computing Environment
Mobile computing systems are made up of different components among which Mobile Support Stations (MSSs) play a key role. This paper proposes an efficient MSS-based non-blocking coordinated checkpointing scheme for mobile computing environment. In the scheme suggested nearly all aspects of checkpointing and their related overheads are forwarded to the MSSs and as a result the workload of Mobile ...
متن کاملCheckpointing and recovery in a transaction-based DSM operating system
Reliability of cluster systems can be improved by periodically saving checkpoints in stable storage. In case of an error a backward error recovery can restart the cluster from the last checkpoint and thus avoiding a fallback to the initial state. Different strategies originally developed for message-passing systems have been adapted for Distributed Shared Memory (DSM) systems. However, it is no...
متن کاملManetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
Manetho is a new transparent rollback recovery protocol for long running distributed computations It uses a novel combination of antecedence graph maintenance unco ordinated checkpointing and sender based message logging Manetho simultaneously achieves the advantages of pessimistic message logging namely limited rollback and fast output commit and the advantage of optimistic message logging nam...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007